mean_absolute_percentage_error (MAPE)#
Mean Absolute Percentage Error (MAPE) measures the average relative size of the errors:
“On average, how far off am I, relative to the true value?”
Because it is unitless, it’s common in forecasting and business settings.
Important caveat: MAPE is undefined when a true target is 0 (and unstable when targets are near 0).
Learning goals#
By the end you should be able to:
define MAPE precisely (and relate it to MAE)
build intuition for “relative error” with plots
implement MAPE from scratch in NumPy (weights + multi-output + safe handling of zeros)
optimize a simple linear regression model by minimizing MAPE with subgradient descent
understand pros/cons and when to use MAPE
Quick import#
from sklearn.metrics import mean_absolute_percentage_error
Prerequisites#
absolute error / residuals
basic linear regression notation
gradients / subgradients (helpful but not required)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error
from sklearn.model_selection import cross_val_score, train_test_split
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
rng = np.random.default_rng(7)
1) Definition#
Let \(y \in \mathbb{R}^n\) be targets and \(\hat{y} \in \mathbb{R}^n\) be predictions. Define the per-sample absolute percentage error (APE):
Then the mean absolute percentage error is:
In practice we usually guard against division by zero and sign ambiguity by using:
This is the definition used by scikit-learn. Note:
scikit-learn returns a relative value in \([0, \infty)\) (e.g.
0.23), not a percentage in \([0, 100]\)to convert to “percent”, multiply by 100
# A tiny example
y_true = np.array([10.0, 100.0, 50.0, 25.0])
y_pred = np.array([12.0, 102.0, 55.0, 20.0])
ape = np.abs(y_true - y_pred) / np.abs(y_true)
mape = float(ape.mean()) # relative value (0.0 = perfect)
mape, 100 * mape, mean_absolute_percentage_error(y_true, y_pred)
(0.13, 13.0, 0.13)
2) Intuition: “2 units” means different things#
MAPE normalizes each absolute error by the true value.
an absolute error of 2 when the true value is 10 is a 20% error
the same absolute error of 2 when the true value is 100 is a 2% error
So MAPE tends to care a lot about getting small targets right.
y_true = np.array([10.0, 100.0])
y_pred = np.array([12.0, 102.0]) # same absolute error (2) in both cases
abs_err = np.abs(y_true - y_pred)
ape_pct = 100 * abs_err / y_true
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Absolute error (units)", "Absolute percentage error (%)"),
)
fig.add_trace(go.Bar(x=["y=10", "y=100"], y=abs_err, name="abs error"), row=1, col=1)
fig.add_trace(go.Bar(x=["y=10", "y=100"], y=ape_pct, name="% error", marker_color="#E45756"), row=1, col=2)
fig.update_yaxes(title_text="|y - ŷ|", row=1, col=1)
fig.update_yaxes(title_text="100×|y - ŷ| / y", row=1, col=2)
fig.update_layout(height=350, title="Same absolute error, very different relative error")
fig.show()
3) MAPE is a weighted MAE (and it’s scale-free)#
Rewrite the safe definition as:
So MAPE is just MAE with per-sample weights that are larger when \(|y_i|\) is small.
A key property is scale invariance:
(While MAE scales linearly with \(c\).)
y_true = rng.lognormal(mean=2.0, sigma=0.8, size=200)
y_pred = y_true * (1.0 + rng.normal(0.0, 0.15, size=y_true.size))
scales = np.array([0.1, 1.0, 10.0, 100.0])
mape_vals = []
mae_vals = []
for c in scales:
mape_vals.append(mean_absolute_percentage_error(c * y_true, c * y_pred))
mae_vals.append(mean_absolute_error(c * y_true, c * y_pred))
fig = go.Figure()
fig.add_trace(go.Scatter(x=scales, y=mape_vals, mode="lines+markers", name="MAPE (relative)"))
fig.add_trace(go.Scatter(x=scales, y=mae_vals, mode="lines+markers", name="MAE (units)", yaxis="y2"))
fig.update_xaxes(title_text="scale factor c", type="log")
fig.update_yaxes(title_text="MAPE", rangemode="tozero")
fig.update_layout(
title="Scale invariance: MAPE stays the same when you scale the target",
height=380,
yaxis2=dict(title="MAE", overlaying="y", side="right", showgrid=False),
)
fig.show()
4) Loss shape: why small targets dominate#
For a single sample with \(|y|>0\), the MAPE loss as a function of prediction \(\hat{y}\) is:
This is a V-shape (like MAE), but its slope is \(1/|y|\).
if \(|y|\) is small, the V is steep → small absolute errors cause large percentage errors
if \(|y|\) is large, the V is shallow → the same absolute error counts less
y_values = [1.0, 10.0, 100.0]
yhat = np.linspace(-20, 140, 600)
fig = go.Figure()
for y in y_values:
loss = np.abs(y - yhat) / np.abs(y)
fig.add_trace(go.Scatter(x=yhat, y=loss, mode="lines", name=f"y={y:g}"))
fig.update_layout(
title="Per-sample MAPE loss vs prediction (different true values)",
xaxis_title="prediction ŷ",
yaxis_title="|y - ŷ| / |y|",
height=380,
)
fig.show()
5) Common pitfalls (and alternatives)#
Pitfalls#
Zero targets: if \(y_i = 0\) then \(|y_i - \hat{y}_i|/|y_i|\) is undefined. scikit-learn uses \(\max(\varepsilon, |y_i|)\), which turns division-by-zero into a huge number.
Near-zero targets: values close to 0 can dominate the average (because they get very large weights).
Negative targets: you can still compute a “relative error” with \(|y_i|\) in the denominator, but the word “percent” can be misleading.
Alternatives#
WAPE (weighted absolute percentage error): $\(\mathrm{WAPE} = \frac{\sum_i |y_i - \hat{y}_i|}{\sum_i |y_i|}\)$ (behaves better with zeros unless all targets are zero).
sMAPE (symmetric MAPE): $\(\mathrm{sMAPE} = \frac{1}{n}\sum_i \frac{2|y_i - \hat{y}_i|}{|y_i| + |\hat{y}_i|}\)$ (bounded, but still has edge cases and interpretability quirks).
MAE/RMSE when absolute error matters.
RMSLE when relative differences matter and \(y \ge 0\).
# How near-zero targets blow up the percentage error
eps = np.finfo(np.float64).eps
y = np.logspace(-6, 1, 500) # from 1e-6 to 10
abs_err = 1.0
ape_no_guard = abs_err / y
ape_with_eps = abs_err / np.maximum(y, eps)
fig = go.Figure()
fig.add_trace(go.Scatter(x=y, y=ape_no_guard, mode="lines", name="1/|y|"))
fig.add_trace(go.Scatter(x=y, y=ape_with_eps, mode="lines", name="1/max(|y|, eps)", line=dict(dash="dash")))
fig.update_xaxes(title_text="|y|", type="log")
fig.update_yaxes(title_text="absolute percentage error", type="log")
fig.update_layout(title="Fixed absolute error (1.0) becomes huge when |y| is tiny", height=380)
fig.show()
6) NumPy implementation from scratch#
Below is a NumPy implementation that mirrors scikit-learn’s behavior:
supports 1D and multioutput targets
supports
sample_weight(weights per sample)supports
multioutputaggregation ("raw_values","uniform_average", or per-output weights)uses \(\varepsilon = \mathrm{eps}(\text{float64})\) to avoid division by zero
def mean_absolute_percentage_error_np(
y_true,
y_pred,
*,
sample_weight=None,
multioutput="uniform_average",
epsilon=None,
):
"""NumPy implementation of scikit-learn's MAPE.
Returns a relative value (e.g. 0.23), not a percent (23%).
"""
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")
if epsilon is None:
epsilon = np.finfo(np.float64).eps
denom = np.maximum(np.abs(y_true), epsilon)
mape = np.abs(y_pred - y_true) / denom
# 1D
if y_true.ndim == 1:
if sample_weight is None:
return float(mape.mean())
w = np.asarray(sample_weight)
if w.shape != (y_true.shape[0],):
raise ValueError(f"sample_weight must have shape {(y_true.shape[0],)}, got {w.shape}")
if np.any(w < 0):
raise ValueError("sample_weight must be non-negative")
return float(np.sum(w * mape) / np.sum(w))
# 2D
if y_true.ndim != 2:
raise ValueError("y_true must be 1D or 2D")
if sample_weight is None:
output_errors = mape.mean(axis=0)
else:
w = np.asarray(sample_weight)
if w.shape != (y_true.shape[0],):
raise ValueError(f"sample_weight must have shape {(y_true.shape[0],)}, got {w.shape}")
if np.any(w < 0):
raise ValueError("sample_weight must be non-negative")
output_errors = np.average(mape, axis=0, weights=w)
if isinstance(multioutput, str):
if multioutput == "raw_values":
return output_errors
if multioutput == "uniform_average":
return float(np.mean(output_errors))
raise ValueError("multioutput must be 'raw_values', 'uniform_average', or array-like")
w_out = np.asarray(multioutput, dtype=float)
if w_out.shape != (y_true.shape[1],):
raise ValueError(f"multioutput weights must have shape {(y_true.shape[1],)}, got {w_out.shape}")
if np.any(w_out < 0):
raise ValueError("multioutput weights must be non-negative")
return float(np.sum(w_out * output_errors) / np.sum(w_out))
# Quick checks vs scikit-learn
# 1D
y_true = rng.normal(size=60)
y_pred = y_true + rng.normal(0, 0.4, size=60)
y_true[[3, 15]] = 0.0 # include zeros to show eps-guard behavior
print("1D")
print("numpy :", mean_absolute_percentage_error_np(y_true, y_pred))
print("sklearn:", mean_absolute_percentage_error(y_true, y_pred))
# Multioutput
Y_true = rng.normal(size=(80, 3))
Y_pred = Y_true + rng.normal(0, 0.5, size=(80, 3))
w = rng.uniform(0.5, 2.0, size=80)
print("\nMultioutput (raw)")
print("numpy :", mean_absolute_percentage_error_np(Y_true, Y_pred, sample_weight=w, multioutput="raw_values"))
print("sklearn:", mean_absolute_percentage_error(Y_true, Y_pred, sample_weight=w, multioutput="raw_values"))
print("\nMultioutput (weighted)")
weights = np.array([0.2, 0.3, 0.5])
print("numpy :", mean_absolute_percentage_error_np(Y_true, Y_pred, sample_weight=w, multioutput=weights))
print("sklearn:", mean_absolute_percentage_error(Y_true, Y_pred, sample_weight=w, multioutput=weights))
1D
numpy : 189188988502492.4
sklearn: 189188988502492.4
Multioutput (raw)
numpy : [1.65270765 5.3344239 1.11974124]
sklearn: [1.65270765 5.3344239 1.11974124]
Multioutput (weighted)
numpy : 2.490739317496624
sklearn: 2.490739317496624
7) Using MAPE to fit a model (from scratch)#
MAPE is often used as an evaluation metric, but you can also use it as a training objective.
For a linear model:
the (safe) MAPE objective is:
This is convex, but not differentiable everywhere (because of \(|\cdot|\)). A simple low-level optimizer is subgradient descent.
Let \(r = Xw + b - y\) and \(d_i = \max(\varepsilon, |y_i|)\). One valid subgradient is:
where division is elementwise.
Interpretation: minimizing MAPE is equivalent to minimizing a weighted MAE with weights \(1/d_i\).
# Synthetic regression data with multiplicative (relative) noise
# This is a setting where a relative-error metric like MAPE often makes sense.
n = 500
x = rng.uniform(0, 200, size=n)
X = x[:, None]
# True linear relationship (targets are strictly positive)
y_clean = 20.0 + 5.0 * x # range ~ [20, 1020]
# Multiplicative lognormal noise with mean 1
sigma = 0.35
mult = np.exp(rng.normal(loc=-0.5 * sigma**2, scale=sigma, size=n))
y = y_clean * mult
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
X_train.shape, X_test.shape
((375, 1), (125, 1))
# Baseline: ordinary least squares (minimizes MSE)
ols = LinearRegression().fit(X_train, y_train)
y_pred_ols = ols.predict(X_test)
mape_ols = mean_absolute_percentage_error(y_test, y_pred_ols)
mae_ols = mean_absolute_error(y_test, y_pred_ols)
mse_ols = mean_squared_error(y_test, y_pred_ols)
(ols.intercept_, ols.coef_[0]), (mape_ols, mae_ols, mse_ols)
((12.796215955793969, 5.021166597032021),
(0.3234497534673614, 139.99874478140572, 45111.330009565856))
def fit_linear_regression_mape_subgradient(X, y, *, lr0=500.0, n_iters=4000, epsilon=None):
"""Minimize MAPE for y_hat = X @ w + b using subgradient descent.
Uses a decaying learning rate: lr_t = lr0 / sqrt(t+1).
"""
X = np.asarray(X)
y = np.asarray(y)
n_samples, n_features = X.shape
if epsilon is None:
epsilon = np.finfo(np.float64).eps
# Good starting point when w=0 for L1-type losses
w = np.zeros(n_features)
b = float(np.median(y))
denom = np.maximum(np.abs(y), epsilon)
history = np.empty(n_iters)
for t in range(n_iters):
y_hat = X @ w + b
r = y_hat - y
g = np.sign(r) / denom # subgradient wrt y_hat
grad_w = (X.T @ g) / n_samples
grad_b = g.mean()
lr = lr0 / np.sqrt(t + 1)
w -= lr * grad_w
b -= lr * grad_b
y_hat2 = X @ w + b
history[t] = np.mean(np.abs(y_hat2 - y) / denom)
return w, b, history
# Feature scaling helps subgradient methods
x_mean = X_train.mean(axis=0)
x_std = X_train.std(axis=0)
X_train_s = (X_train - x_mean) / x_std
X_test_s = (X_test - x_mean) / x_std
w_mape, b_mape, hist = fit_linear_regression_mape_subgradient(X_train_s, y_train)
# Convert parameters back to the original x scale:
# y_hat = w_s * ((x - mu)/sigma) + b_s = (w_s/sigma) * x + (b_s - w_s*mu/sigma)
slope_mape = w_mape[0] / x_std[0]
intercept_mape = b_mape - slope_mape * x_mean[0]
y_pred_mape = intercept_mape + slope_mape * X_test[:, 0]
mape_mape = mean_absolute_percentage_error(y_test, y_pred_mape)
mae_mape = mean_absolute_error(y_test, y_pred_mape)
mse_mape = mean_squared_error(y_test, y_pred_mape)
(intercept_mape, slope_mape), (mape_mape, mae_mape, mse_mape)
((23.369710958357132, 3.5535421606512725),
(0.28248252581867633, 162.72868717407727, 75607.15743363052))
fig = px.line(
y=100 * hist,
title="Subgradient descent: MAPE objective vs iteration",
labels={"index": "iteration", "y": "train MAPE (%)"},
)
fig.show()
# Fit visualization: data + fitted lines
x_line = np.linspace(X.min(), X.max(), 250)
y_line_true = 20.0 + 5.0 * x_line
y_line_ols = ols.intercept_ + ols.coef_[0] * x_line
y_line_mape = intercept_mape + slope_mape * x_line
fig = go.Figure()
fig.add_trace(go.Scatter(x=X_test[:, 0], y=y_test, mode="markers", name="test data", marker=dict(size=6, opacity=0.55)))
fig.add_trace(go.Scatter(x=x_line, y=y_line_true, mode="lines", name="true line", line=dict(color="green", dash="dash")))
fig.add_trace(go.Scatter(x=x_line, y=y_line_ols, mode="lines", name=f"OLS (MSE) | test MAPE={100*mape_ols:.1f}%"))
fig.add_trace(go.Scatter(x=x_line, y=y_line_mape, mode="lines", name=f"MAPE-trained | test MAPE={100*mape_mape:.1f}%", line=dict(color="#E45756")))
fig.update_layout(title="Optimizing MAPE shifts the fit toward relative accuracy", height=420)
fig.update_xaxes(title_text="x")
fig.update_yaxes(title_text="y")
fig.show()
# Distribution of absolute percentage errors on the test set
eps = np.finfo(np.float64).eps
ape_ols = np.abs(y_test - y_pred_ols) / np.maximum(np.abs(y_test), eps)
ape_mape = np.abs(y_test - y_pred_mape) / np.maximum(np.abs(y_test), eps)
fig = go.Figure()
fig.add_trace(go.Box(y=100 * ape_ols, name="OLS (MSE)", boxpoints="outliers"))
fig.add_trace(go.Box(y=100 * ape_mape, name="MAPE-trained", boxpoints="outliers"))
fig.update_yaxes(title_text="absolute % error", type="log")
fig.update_layout(title="Test absolute percentage errors (%)", height=380)
fig.show()
8) Practical usage (scikit-learn)#
MAPE is commonly used for evaluation and model selection.
from sklearn.metrics import mean_absolute_percentage_error
mean_absolute_percentage_error(y_true, y_pred)
For cross-validation, scikit-learn follows a “bigger is better” convention and exposes MAPE as negative MAPE:
cross_val_score(model, X, y, scoring="neg_mean_absolute_percentage_error")
scores = cross_val_score(
LinearRegression(),
X,
y,
scoring="neg_mean_absolute_percentage_error",
cv=5,
)
# Convert back to positive MAPE and percent
mape_cv = -scores
float(mape_cv.mean()), float((100 * mape_cv).mean())
(0.3169360284034368, 31.693602840343686)
Pros, cons, and when to use MAPE#
Pros#
Interpretable: “average relative error” is easy to communicate
Unitless / scale-free: comparable across targets measured in different units or at different scales
Natural for multiplicative noise: when you care about proportional errors (10% off is 10% off)
Cons#
Undefined at 0 and unstable near 0 (can explode and dominate the mean)
Over-weights small targets: optimization behaves like weighted MAE with weights \(1/|y|\)
Awkward with negatives: the “percent” interpretation breaks down if \(y\) can be negative
Good use cases#
forecasting demand/sales/traffic where values are strictly positive and not too close to 0
comparing performance across multiple series with different scales
business metrics where relative deviation matters more than absolute units
Avoid when#
targets can be 0 or near 0 (consider MAE/RMSE, WAPE, or sMAPE)
you care about absolute units (e.g. “within ±5 kWh”)
targets can be negative and “percent” becomes ambiguous
Exercises#
Implement WAPE and compare it to MAPE on a dataset with many zeros.
Show that minimizing MAPE is equivalent to minimizing MAE with per-sample weights \(w_i = 1/\max(\varepsilon, |y_i|)\).
For a constant predictor \(\hat{y}_i \equiv c\), experiment numerically with which \(c\) minimizes MAPE (hint: weighted median).
References#
scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_percentage_error.html
Hyndman & Athanasopoulos, Forecasting: Principles and Practice (accuracy measures): https://otexts.com/fpp3/accuracy.html